Why Chinese Web-as-Corpus is Wacky? Or: How Big Data is Killing Chinese Corpus Linguistics

نویسنده

  • Shu-Kai Hsieh
چکیده

This paper aims to examine and evaluate the current development of using Web-as-Corpus (WaC) paradigm in Chinese corpus linguistics. I will argue that the unstable notion of wordhood in Chinese and the resulting diverse ideas of implementing word segmentation systems have posed great challenges for those who are keen on building web-scaled corpus data. Two lexical measures are proposed to illustrate the issues and methodological discussions are provided.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

“Those Nation Wreckers are Suffering from Inferiority Complex”: The Depiction of Chinese Miners in the Ghanaian Press

This article studies the depiction of Chinese miners in the Ghanaian news website entitled Modern Ghana. A total of 87 articles comprising 43752 words were retrieved. Van Leeuwen’s (2008) theory of the representation of the social actors was utilised to examine the depiction of Chinese miners in the Ghanaian press. In this regard, six applicable tools were used and these include exclusion, role...

متن کامل

Corpora and English Teaching: Retrospect and Prospect

As a whole system of methods and principles of how to apply corpora in language study, corpus linguistics has revolutionized nearly all branches of linguistics. In the wake of this revolution, people began to rethink language pedagogy from corpus perspective in early 1990s. However, Today, although Corpus Linguistics has contributed much to English education, difficulties do exist, especially i...

متن کامل

The Jinan Chinese Learner Corpus

We present the Jinan Chinese Learner Corpus, a large collection of L2 Chinese texts produced by learners that can be used for educational tasks. The present work introduces the data and provides a detailed description. Currently, the corpus contains approximately 6 million Chinese characters written by students from over 50 different L1 backgrounds. This is a large-scale corpus of learner Chine...

متن کامل

Chinese Sketch Engine and Mapping Principles: A Corpus-Based Study of Conceptual Metaphors Using the BUILDING Source Domain

The goal of this paper is to use a largescale corpus, i.e. the Gigaword Corpus via the interface of Chinese Sketch Engine, to determine underlying reasons between source and target domain pairings for conceptual metaphors, called Mapping Principles. In particular, we will employ a frequency-based collocational approach to examine metaphors that use the source domain of BUILDING in Mandarin Chin...

متن کامل

On Bias-free Crawling and Representative Web Corpora

In this paper, I present a specialized opensource crawler that can be used to obtain bias-reduced samples from the web. First, I briefly discuss the relevance of bias-reduced web corpus sampling for corpus linguistics. Then, I summarize theoretical results that show how commonly used crawling methods obtain highly biased samples from the web. The theoretical part of the paper is followed by a d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014